770 research outputs found
Real-time 3D Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction
Semantic Scene Completion (SSC) aims to simultaneously predict the volumetric
occupancy and semantic category of a 3D scene. In this paper, we propose a
real-time semantic scene completion method with a feature aggregation strategy
and conditioned prediction module. Feature aggregation fuses feature with
different receptive fields and gathers context to improve scene completion
performance. And the conditioned prediction module adopts a two-step prediction
scheme that takes volumetric occupancy as a condition to enhance semantic
completion prediction. We conduct experiments on three recognized benchmarks
NYU, NYUCAD, and SUNCG. Our method achieves competitive performance at a speed
of 110 FPS on one GTX 1080 Ti GPU.Comment: Accepted by ICI
Conditional DETR V2: Efficient Detection Transformer with Box Queries
In this paper, we are interested in Detection Transformer (DETR), an
end-to-end object detection approach based on a transformer encoder-decoder
architecture without hand-crafted postprocessing, such as NMS. Inspired by
Conditional DETR, an improved DETR with fast training convergence, that
presented box queries (originally called spatial queries) for internal decoder
layers, we reformulate the object query into the format of the box query that
is a composition of the embeddings of the reference point and the
transformation of the box with respect to the reference point. This
reformulation indicates the connection between the object query in DETR and the
anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the
box queries from the image content, further improving the detection quality of
Conditional DETR still with fast training convergence. In addition, we adopt
the idea of axial self-attention to save the memory cost and accelerate the
encoder. The resulting detector, called Conditional DETR V2, achieves better
results than Conditional DETR, saves the memory cost and runs more efficiently.
For example, for the DC-ResNet- backbone, our approach achieves
AP with FPS on the COCO set and compared to Conditional DETR, it
runs faster, saves \% of the overall memory cost, and improves
AP score
Recursive Generalization Transformer for Image Super-Resolution
Transformer architectures have exhibited remarkable performance in image
super-resolution (SR). Since the quadratic computational complexity of the
self-attention (SA) in Transformer, existing methods tend to adopt SA in a
local region to reduce overheads. However, the local design restricts the
global context exploitation, which is crucial for accurate image
reconstruction. In this work, we propose the Recursive Generalization
Transformer (RGT) for image SR, which can capture global spatial information
and is suitable for high-resolution images. Specifically, we propose the
recursive-generalization self-attention (RG-SA). It recursively aggregates
input features into representative feature maps, and then utilizes
cross-attention to extract global information. Meanwhile, the channel
dimensions of attention matrices (query, key, and value) are further scaled to
mitigate the redundancy in the channel domain. Furthermore, we combine the
RG-SA with local self-attention to enhance the exploitation of the global
context, and propose the hybrid adaptive integration (HAI) for module
integration. The HAI allows the direct and effective fusion between features at
different levels (local or global). Extensive experiments demonstrate that our
RGT outperforms recent state-of-the-art methods quantitatively and
qualitatively. Code is released at https://github.com/zhengchen1999/RGT.Comment: Code is available at https://github.com/zhengchen1999/RG
- …